Using a student model to improve a computer tutor’s speech recognition
نویسندگان
چکیده
Intelligent computer tutors can derive much of their power from having a student model that describes the learner’s competencies. However, constructing a student model is challenging for computer tutors that use automated speech recognition (ASR) as input. This paper reports using ASR output from a computer tutor for reading to compare two models of how students learn to read words: a model that assumes students learn words as whole-unit chunks, and a model that assumes students learn the individual letter sound mappings that make up words. We use the data collected by the ASR to show that a model of letter sound mappings better describes student performance. We then compare using the student model and the ASR, both alone and in combination, to predict which words the student will read correctly, as scored by a human transcriber. Surprisingly, majority class has a higher classification accuracy than the ASR. However, we demonstrate that the ASR output still has useful information, and that classification accuracy is not a good metric for this task, and the Area Under Curve (AUC) of ROC curves is a superior scoring method. The AUC of the student model is statistically reliably better (0.670 vs. 0.550) than that of the ASR, which in turn is reliably better than majority class. These results show that ASR can be used to compare theories of how students learn to read words, and modeling individual learner’s proficiencies may enable improved speech recognition. 1 Motivation and Introduction Intelligent Tutoring Systems (ITS) derive much of their power from having a student model [1] that describes the learner’s proficiencies at various aspects of the domain to be learned. For example, the student model can be used to determine what feedback to give [2] or to have the students practice a particular skill until it is mastered [3] Unfortunately, language tutors have difficulty in developing strong models of the student. Much of the difficulty comes from the inaccuracies inherent in automated speech recognition (ASR). Providing explicit feedback based only on student performance on one attempt at reading a word is not viable since the accuracy at distinguishing correct from incorrect reading is not high enough [4]. Due to such problems, student modeling has not received as much attention in computer assisted language learning systems as in classic ITS [5], although there are exceptions such as [6]. A common approach to developing cognitive models for use in an ITS is to use thinkaloud protocols [7, 8]. In a think-aloud study [7], participants verbalize their thinking while solving a problem. Such verbalizations are then used to construct a cognitive model of how the participants were solving the task. This approach has also been used to develop cognitive models for ITS [8]. Unfortunately, due to the speed of the reading process, thinkaloud methodology is not well suited to modeling reading. There have been efforts to develop cognitive models that describe the reading process. For example, [9] developed a parallel distributed processing model that was able to simulate many aspects of human performance. A major drawback of this approach is the models are designed for individual word reading and not for reading connected text. Furthermore, rather than observing the reader’s behavior with each word to model this particular reader, these studies use simulated input to try to mimic known human behavioral characteristics. The goal of this paper is to first quickly compare two models of how children learn to read, and then to use the better model to improve the ability of the ASR to listen accurately to children. We first describe our approach to collecting and representing our data, and describe two candidate models of children’s reading. We then compare which model better fits student performance as scored by the ASR. Finally to determine whether the student model can improve listening accuracy, we compare the effects of combining the student model and the ASR to better predict how a human transcriber judges words as read correctly or incorrectly. 2 Approach to Constructing the Student Model In this Section we discuss the data used for experiments, our statistical framework for modeling, and the two models of reading we are investigating. 2.1 Data collected and representation We collected data from 541 students working with a computer tutor that helps children learn how to read. Over the course of the school year, these students read approximately 4.1 million words (as heard by the ASR). The tutor presented one sentence (or fragment) at a time, and asked the student to read it aloud. The student’s speech was segmented into utterances that ended when the student stopped speaking. Each utterance was processed by the ASR and aligned against the sentence. This alignment scores each word of the sentence as either being accepted (heard by the ASR as read correctly), rejected (the ASR heard and aligned some other word), or skipped (not read by the student). We use the terms “accepted” and “rejected” rather than “correct” and “incorrect” due to inaccuracies in the ASR. The ASR only notices about 25% of student misreadings, and scores as incorrectly read about 4% of words that were read correctly. Therefore “accept” and “reject” are more accurate terms. One problem is determining how to score each word in the sentence text. As an example, suppose the student is trying to read the sentence “They are formed over millions of years and once depleted will take millions of years to replenish,” and misreads “depleted,” and stops reading after “will.” Clearly the word “depleted” was read incorrectly, but what about the words “take” through “replenish?” It is odd to score these words as incorrect, since the student did not try to read them. However, the student stopped reading the sentence for some reason. Since his true reason for stopping is unknown, we assume the student had difficulty with the next word in the sentence where he stopped reading. So in the above example, the student would be considered to have misread “take.” Our heuristic for scoring the sentence words was: 1. For each utterance a. Start = position of first accepted word b. End = 1+position of last accepted word c. Use the ASR’s accept/reject decision to score all words from Start through End as correctly or incorrectly read. d. Even if the ASR accepted a word, if the student hesitated more than 300ms, score that word as incorrect. 2. For each sentence word w a. Find the first utterance where w’s position is between Start and End b. Use the ASR’s score for w from that utterance. If nothing is aligned against w, score it as incorrectly read. c. If a student requested help on w before it was accepted by the ASR, mark it as incorrectly read. d. If w is not contained within any utterance, then it is not scored since the student did not attempt to read the word. To continue the above example, if the student’s second attempt at reading the sentence consisted of “take millions of years to replenish,” then all of the sentence words would be accepted as read correctly except for “depleted” (since it was misread) and “take” (since in the first utterance that contains this sentence word nothing was aligned against the sentence word). After using this methodology to combine utterances, and removing students who were not part of the official study, we were left with 360 students and 1.95 million sentence words that students attempted to read. On average, students used the tutor for 8.5 hours. Most students were between six and eight years old, and had reading skills appropriate for their age. 2.2 Knowledge tracing Now that we have determined how to score student attempts at reading a word as correct or incorrect, we must map those overt actions to some internal representation of the student’s knowledge. Prior work in this area [10] has shown that knowledge tracing [3] is an effective approach for using ASR output to model students. The goal of knowledge tracing is to map observable student actions while performing a skill (whether the student’s response is correct or incorrect) to internal knowledge states (whether the student knows the skill or not). As illustrated in Figure 1, knowledge tracing maintains four constant parameters for each skill. Two parameters, L0 and t, are called learning parameters and refer to the students initial knowledge and to the probability of learning a skill given an opportunity to apply it, respectively. Two parameters, slip and guess, are called performance parameters and used to account for student performance not being a perfect reflection of underlying knowledge. The guess parameter is the probability that a student who has not mastered the skill can generate a correct response. For example, on a multiple-choice test with four response choices, a student with no knowledge still has a 25% chance of getting the question correct. The slip parameter is used to account for even knowledgeable students making an occasional mistake. For example, a student who when asked to multiply 4 and 3, could accidentally hit the keys in the wrong order and type “21.” Figure 1. Overview of knowledge tracing For each student and for each skill, knowledge tracing is maintains the probability that the student knows the skill. Knowledge tracing updates its estimates of P(knows) based on student performance. The approach is that whenever a student has an opportunity to apply a skill, observe whether the student performed the skill correctly or incorrectly. P(slip) P(guess) 1-P(guess) P(L0) P(Knows Skill)
منابع مشابه
Automatically Assessing Oral Reading Fluency in a Computer Tutor that Listens
Much of the power of a computer tutor comes from its ability to assess students. In some domains, including oral reading, assessing the proficiency of a student is a challenging task for a computer. Our approach for assessing student reading proficiency is to use data that a computer tutor collects through its interactions with a student to estimate his performance on a human-administered test ...
متن کاملImproving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM
Improving phoneme recognition has attracted the attention of many researchers due to its applications in various fields of speech processing. Recent research achievements show that using deep neural network (DNN) in speech recognition systems significantly improves the performance of these systems. There are two phases in DNN-based phoneme recognition systems including training and testing. Mos...
متن کاملImproving of Feature Selection in Speech Emotion Recognition Based-on Hybrid Evolutionary Algorithms
One of the important issues in speech emotion recognizing is selecting of appropriate feature sets in order to improve the detection rate and classification accuracy. In last studies researchers tried to select the appropriate features for classification by using the selecting and reducing the space of features methods, such as the Fisher and PCA. In this research, a hybrid evolutionary algorit...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005